Computing correlated aggregates over a data stream

ABSTRACT

Described herein are approaches for computing correlated aggregates. An aspect provides for receiving a stream of data elements at a device, each data element having at least one numerical attribute; maintaining in memory plurality of tree structures comprising a plurality of separate nodes for summarizing numerical attributes of the data elements with respect to a predicate value of a correlated aggregation query, said maintaining comprising: creating the plurality of tree structures in which each node implements one of: a probabilistic counter and a sketch, wherein said probabilistic counter and said sketch each act to estimate aggregated data element numerical attributes to form a summary of said numerical attributes; and responsive to a correlated aggregation query specifying said predicate value, using said plurality of tree structures as a summary of said data element numerical attributes to compute a response to said correlated aggregate query.

CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. patent application Ser. No.13/278,469, entitled COMPUTING CORRELATED AGGREGATES OVER A DATA STREAM,filed on Oct. 21, 2011, which is incorporated by reference in itsentirety.

STATEMENT OF GOVERNMENT INTEREST

This invention was made with government support under grant numbers:CNS0834743 and CNS0831903, awarded by the National Science Foundation.The government has certain rights in the invention.

FIELD OF THE INVENTION

The subject matter presented herein generally relates to efficientlycomputing correlated aggregates over a data stream.

BACKGROUND

Processing a massive data set presented as a large data stream ischallenging. The data set may be, for example, network traffic data orstreaming data from external memory. A difficulty encountered is thatthe data set is too large to store in cache for analysis. Because of thesize of the data set, processing requires fast update time, use of verylimited storage, and possibly only a single pass over the data.

In computing statistics for certain types of data set analysis, aproblem is the correlated aggregate query problem. Here, unlike intraditional data streams, there is a stream of two dimensional dataitems (i, y), where i is an item identifier, and y is a numericalattribute. A correlated aggregate query requires first, applying aselection predicate along the y dimension, followed by an aggregationalong the first dimension. An example of a correlated query is: “On astream of IP packets, compute the k-th frequency moment of all thesource IP address fields among packets whose length was more than 100bytes”. Answering such a query may for example allow a networkadministrator to identify IP addresses using an excessive amount ofnetwork bandwidth.

BRIEF SUMMARY

One aspect provides a method for summarizing attributes of data elementsof a data set for computing correlated aggregates, comprising: receivinga stream of data elements at a device, each data element having at leastone numerical attribute; maintaining in memory a plurality of treestructures comprising a plurality of separate nodes for summarizingnumerical attributes of the data elements with respect to a predicatevalue of a correlated aggregation query, said maintaining comprising:creating the plurality of tree structures in which each node implementsone of: a probabilistic counter and a sketch, wherein said probabilisticcounter and said sketch each act to estimate aggregated data elementnumerical attributes to form a summary of said numerical attributes; andresponsive to a correlated aggregation query specifying said predicatevalue, using said plurality of tree structures as a summary of said dataelement numerical attributes to compute a response to said correlatedaggregate query.

The foregoing is a summary and thus may contain simplifications,generalizations, and omissions of detail; consequently, those skilled inthe art will appreciate that the summary is illustrative only and is notintended to be in any way limiting.

For a better understanding of the embodiments, together with other andfurther features and advantages thereof, reference is made to thefollowing description, taken in conjunction with the accompanyingdrawings. The scope of the invention will be pointed out in the appendedclaims.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 illustrates an example operating environment for correlatedaggregation queries.

FIGS. 2-4 illustrate an example of processing to provide an estimate fora COUNT correlated aggregation.

FIGS. 5-7 illustrate an example of processing to provide an estimate forgeneral correlated aggregations.

FIG. 8 illustrates examples of sketch space v. relative error forexample data sets.

FIG. 9 illustrates examples of sketch space v. stream size tor exampledata sets.

FIG. 10 illustrates examples of sketch space v. stream size for exampledata sets.

FIG. 11 illustrates examples of sketch space v. stream size for exampledata sets.

FIG. 12 illustrates examples of sketch space v. relative error forexample data sets.

FIG. 13 illustrates examples of sketch space v. stream size for exampledata sets.

FIG. 14 illustrates an example computer system.

DETAILED DESCRIPTION

It will be readily understood that the components of the embodiments, asgenerally described and illustrated in the figures herein, may bearranged and designed in a wide variety of different configurations inaddition to the described example embodiments. Thus, the following moredetailed description of the example embodiments, as represented in thefigures, is not intended to limit the scope of the claims, but is merelyrepresentative of those embodiments.

Reference throughout this specification to “embodiment(s)” (or the like)means that a particular feature, structure, or characteristic describedin connection with the embodiment is included in at least oneembodiment. Thus, appearances of the phrases “according to embodiments”or “an embodiment” (or the like) in various places throughout thisspecification are not necessarily all referring to the same embodiment.

Furthermore, the described features, structures, or characteristics maybe combined in any suitable manner in different embodiments. In thefollowing description, numerous specific details are provided to give athorough understanding of example embodiments. One skilled in therelevant art will recognize, however, that aspects can be practicedwithout certain specific details, or with other methods, components,materials, et cetera. In other instances, well-known structures,materials, or operations are not shown or described in detail to avoidobfuscation.

On a stream of two dimensional data items (i, y) where i is an itemidentifier, and y is a numerical attribute, a correlated, aggregatequery requires applying a selection predicate first along the seconddimension, y, followed by an aggregation along the first dimension. Forselection predicates of the form (y<c) or (y>c), where parameter c isprovided at query time, embodiments provide streaming processing andlower bounds for estimating statistics of the resulting substream ofelements that satisfy the predicate.

For example, an embodiment can answer the following query: “On a streamof IP packets, compute the k-th frequency moment F_(k)(k=0, 1, 2 orgreater) of all the source IP address fields among packets whose lengthwas more than 100 bytes”. A process according to an embodiment for F₁(basic count) improves previous approaches by roughly a logarithmicfactor and is near-optimal. An approach provided according toembodiments for k≠1 provides the first sub-linear space algorithms inthis model for these problems. Embodiments also estimate heavy hitters,rarity, and similarity in this model. The memory requirements aresignificantly smaller than existing linear storage schemes for largedata sets, and embodiments simultaneously achieve last per-recordprocessing time.

The description now turns to the figures. The illustrated exampleembodiments will be best understood by reference to the figures. Thefollowing description is intended only by way of example and simplyillustrates certain example embodiments representative of the invention,as claimed.

Referring to FIG. 1, consider a data stream (data set/stream are usedinterchangeably herein) of tuples S=(x_(i), y_(i)), i=1 . . . n, wherex_(i) is an item identifier, and y_(i) a numerical attribute. Acorrelated aggregate query C(σ, AGG, S) specifies two functions, aselection predicate σ along the y dimension. It requires that theselection be applied first on stream S, followed by the aggregation.More formally,

C(σ, AGG, S)=AG{x _(i)|σ(y _(i)) is true}

Importantly, the selection predicate σ need only be completely specifiedat query time, and may not be known when the stream is being observed.For example, on a stream of IP packets, with x denoting the source IPaddress, and y denoting the number of bytes in the packet, on anobserved packet stream, a network administrator may ask the query: “Whatis the second frequency moment (F₂) of all IP addresses for thosepackets that are less than 1000 bytes”? The selection predicateσ=(y<1000) is completely specified only at query time, giving thenetwork administrator flexibility to perform further queries based onthe result of previous queries. For example, if the answer for the abovequery was “interesting” (however defined), the network administrator maythen refine the next query further by asking for the same aggregate, butgiving a different selection predicate, for example σ=(y<500). Note thatwith this framework, the network administrator could form the query:“What is the F₂ of all IP addresses for those packets whose packet sizeis smaller than the mean packet size of the stream”, by firstdetermining the mean packet size, say μ, using an aggregated query withsection predicate y<μ.

A traditional data stream summary provides an accurate answer to anaggregate query, such as frequency moments, quantities, andheavy-hitters, over the entire data stream. In contrast, a summary forcorrelated aggregate provides accurate answers to a family of queriesthat can he posed on the stream, where different queries in the familyspecify the same aggregation function, but over different subsets of theinput stream. These subsets are selected by the predicate along theprimary dimension. Correlated aggregates arise naturally in analytics onmulti-dimensional streaming data, and space-efficient methods forimplementing such aggregates are useful in streaming analytics systemssuch as in IBM SYSTEM S.

The summaries for correlated aggregates provided by embodiments allowfor more flexible interrogation of the data stream than is possible witha traditional stream summary. For example, consider a data stream of IPflow records, such as those output by network routers equipped withCISCO netflow. Suppose that one is interested in only two attributes perflow record, the destination address of the flow, and the size (numberof bytes) of the flow. Using a summary for correlated aggregate AGG,along with a quantile summary for the y dimension (any well known streamquantity summaries may be used), it is possible for a networkadministrator to execute the following sequence queries on the stream:

(1) First, the quantile summary can be queried to find the median sizeof the flow. (2) Next, using the summary for correlated aggregates, thenetwork administrator can query the aggregate AGG of all those flowrecords whose flow size was more than the median flow size. (3) If theanswer to query (2) was abnormal in the network administrator's opinion,and the network administrator needed to find properties of the very highvolume flows, this may be accomplished by querying for the aggregate ofall those flow records whose flow size is in, for example, the 95percent quantile or above (the 95 percent quantile may be found usingthe quantile summary).

Thus, a summary that allows for such queries can be used for deeperinvestigation of larger data streams. Important to this is theflexibility that the parameter for the selection predicate can bespecified at query time, which allows the user to iteratively adapt hisor her query based on the results of previous queries. Fordata-intensive analytics systems, such as network management systems andsensor data management, this can be a very powerful tool.

In this description, both processing and lower bounds for summarizingdata streams for correlated aggregate queries are provided. Embodimentsconsider selection predicates of the form (y≦y₀) or (y≧y₀), where y₀ isspecified at query time.

Estimation of Basic Count (F₁)

For estimating the basic count, the x values of the stream elements (x,y) do not matter, hence only considered is the stream Y=y₁, y₂, . . . ,y_(n). The y_(i)'s may not be distinct, and do not necessarily arrive inincreasing order. The problem is to design a summary that can answer thefollowing query: given a parameter y provided at query time, how manyy_(i)'s are less than or equal to y? In other words, estimateCOUNT(y)=|{y_(i)∈Y|y_(i)<y}|. Given user parameters ε, 0<ε<1 and δ,0<δ<1, an embodiment seeks an answer that is within an ε relative errorof the exact answer with probability at least 1−δ.

The starting point for a solution is the “splittable histogram” datastructure of prior work. This data structure stores histograms atmultiple levels, i=0, 1, 2, . . . , such that the error due to using adata structure at level i is proportional to 2^(i), but as i increases,the set of possible queries that can be answered by the data structureat level i also become larger. The rough intuition here is that if adata structure is used at a large level (a large value of i), though theerror is large (proportional to 2^(i)), the answer must also be large,hence the relative error is controlled.

It is noted that that the data structures in prior work use manycounters that may need to hold very large values. Observe that thesecounts need not be exact, and hence an embodiment uses probabilisticcounters, which provide approximate counts of large numbers usingsignificantly smaller space than exact counters. The error in thecounters and the error in the data structures combine to give the totalerror in the estimate.

There is a technical difficulty in getting the above idea to work, whichis that the states of the counters are used to trigger certain actionsof the data structures, such as the formation of new nodes in the datastructure. Replacing these counters with probabilistic counters leads todealing with random variables and events that are not all independent ofeach other, and this dependency has to be handled carefully. Presentedherein are further details of the process for doing this and theanalysis.

Probabilistic Counters

A probabilistic counter is a data structure for approximate countingthat supports two operations: increment and query. For positive integerM and γ>0, let PC(M, γ) denote a probabilistic counter with base (1+γ),that counts until a maximum estimated value of M. Let C_(n) denote thestate of the counter after n increment operations. Using C_(n), it ispossible to return an estimate {circumflex over (n)} of n. If{circumflex over (n)}≧M, the counter cannot be incremented any further,and the counter is said to be “closed”. If {circumflex over (n)}<M, thenthe bucket is said to be “open”.

Theorem 1: Let {circumflex over (n)} denote the estimated count returnedby the counter PC(M, γ) after n increments. Then E[{circumflex over(n)}]=n, and

${{VAR}\left\lbrack \hat{n} \right\rbrack} = {\frac{\gamma \; {n\left( {n + 1} \right)}}{2}.}$

The counter can be stored in space no more than

$\left( {{\log \mspace{14mu} \log \mspace{14mu} M} + {\log \frac{1}{\gamma}}} \right)$

bits.

The above theorem concerns the estimates returned by the probabilisticcounter. The number of increments needed for the counter to reach itsmaximum value need to be analyzed, so a slightly different type ofguarantee is required.

Lemma 1: For C=PC(M, γ), let I(C) denote the number of insertions into Cthat causes it to be closed. Then E[I(C)]=M+1, and

${{VAR}\left\lbrack {\mathcal{I}(C)} \right\rbrack} = {\frac{\gamma \; {M\left( {M + 1} \right)}}{2}.}$

The proof of this lemma is in the appendix.

Data Structure

The data structure consists of histograms S₀, S₁, . . . , S_(l) _(max) ,where l_(max)=log(2εU), where U is an upper bound on the value ofCOUNT(y) for any value of y. Each histogram S_(i) consists of no morethan

$\alpha = \frac{4\left( {{\log \; y_{\max}} + 1} \right)}{\varepsilon}$

buckets.

Each bucket b∈S_(i) is a probabilistic counter

${{PC}\left( {{2^{i} + 1},\frac{\varepsilon^{2}\delta}{\alpha}} \right)}.$

The buckets form a rooted binary tree, where each internal node hasexactly two children. The tree in S_(i) is stored within an array bykeeping the nodes of the tree in in-order traversal of the tree. This isunambiguous, by storing within each bucket an additional bit that sayswhether the bucket is a leaf in the tree, or an internal node.

Every bucket is responsible for a range of y values, which isrepresented implicitly by the position of the bucket in the tree. The“span” of a bucket b is the range of timestamps that the bucket b isresponsible for. The span of the root of the tree is [0, y_(max)]. Ifthe span of node ν is [y₁, y₂], then the span of its left child is [y₁,(y₁+y₂−1)/2], and the span of its right child is [(y₁+y₂+1)/2, y ₂]. Ina given level l, the span of all buckets at the same depth in the treeare disjoint, but the span of buckets at different depth may overlapwith each other. If they do overlap, then the span of one of the bucketsmust be a proper subset of the span of the other (overlapping) bucket.Each bucket has a probabilistic counter associated with it.

For bucket b, let span(b) denote the span of the bucket. Let left(b)denote the leftmost endpoint of span(b), and right(b) denote therightmost endpoint of span(b). Let PC(b) to denote the counterassociated with the bucket. The process for COUNT is illustrated inFIGS. 2-4.

Lemma 2: The space complexity of the data structure for F₁ is

$O\left( {\frac{1}{\varepsilon}\left( {\log \; \varepsilon \; U} \right)\left( {\log \; y_{\max}} \right)\left( {{\log \; \log \; U} + {\log \frac{1}{\varepsilon}} + {\log \frac{1}{\delta}}} \right)} \right)$

bits, where U is an upper bound on the value of COUNT(y). Iflog(y_(max))=Θ(log(U)), then for constant ε and δ, the space complexityis O(log² U log log U).

Proof: The space complexity is the space taken to store O(l_(max), α)buckets. Each bucket b consists of a probabilistic counter PC(M, ε²δ/α),where the value of M is no more than U. The space complexity of the datastructure follows from Theorem 1.

Definition 1: For l=0 . . . l_(max), level l is defined to be incompletefor query y if T_(l)<y. If T_(l)≧y, then level l is said to be completefor query y.

For bucket b, let A(b) denote the number of stream items that wereinserted into this bucket. Let EST(b) denote the current estimate of thenumber of items that were inserted into the bucket. This notation isextended to sets of buckets too. For any set of buckets B, all in thesame level, let

${{A(B)} = {\sum\limits_{b \in B}^{\;}{A(b)}}},$

and let

${{\mathcal{E}}(B)} = {\sum\limits_{b \in B}^{\;}{{{\mathcal{E}}(b)}.}}$

Lemma 3 Let I_(l) be any set of internal nodes in S_(l). ThenEST(I_(l))=2^(l)|I_(l)|, E[A(I_(l))]=2^(l)|I_(l)|, and

$\Pr\left\lbrack {{{{A\left( I_{} \right)} - {{{EST}\left( I_{} \right)}\left. {> {\varepsilon \; {{EST}\left( I_{} \right)}}} \right\rbrack}} < {\frac{\delta}{2{I_{}}}.}}} \right.$

Proof: For any b∈I_(l), there is EST(b)=2^(l). Since

${{{EST}\left( I_{} \right)} = {\sum\limits_{b \in I_{}}^{\;}{{EST}(b)}}},$

it follows that EST(I_(l))=2^(l)|I_(l)|. Further, A(b)=I(PC(2^(l)−1,γ)). From Lemma 1, there is E[A(b)]=2^(l), and

${{VAR}\left\lbrack {A(b)} \right\rbrack} = {\frac{\gamma \; 2^{}\left( {2^{} - 1} \right)}{2}.}$

${A\left( I_{} \right)} = {\sum\limits_{b \in I_{}}^{\;}{A(b)}}$

Using linearity of expectation yields:

${E\left\lbrack {A\left( I_{} \right)} \right\rbrack} = {{\sum\limits_{b \in I_{}}^{\;}{E\left\lbrack {A(b)} \right\rbrack}} = {2^{}{I_{}}}}$

For each bucket b∈I_(l),

$\Pr\left\lbrack {{{{A(b)} - {2^{}\left. {> {\varepsilon \; 2^{}}} \right\rbrack}} \leq \frac{{VAR}\left\lbrack {A(b)} \right\rbrack}{\varepsilon^{2}2^{2^{}}} \leq \frac{\gamma}{2\varepsilon^{2}}}} \right.$

Using a union bound over all the buckets, and using

${\gamma = \frac{\varepsilon^{2}\delta}{\alpha}},$

the desired result is obtained.

Lemma 4: Let L be any set of leaves in S_(l), Then E[EST(L)]=A(L), and

Pr[|EST(L)−A(L)|>εA(L)]<δ

Proof: For any b∈L, there is E[EST(b)]=A(B), from Theorem 1. UsingLinearity of Expectation,

$\begin{matrix}{{E\left\lbrack {{EST}(L)} \right\rbrack} = {E\left\lbrack {\sum\limits_{b \in L}{{EST}(b)}} \right\rbrack}} \\{= {\sum\limits_{b \in L}{E\left\lbrack {{EST}(b)} \right\rbrack}}} \\{= {\sum\limits_{b \in L}{A(b)}}} \\{= {A(L)}}\end{matrix}$

For the leaves, after the actual counts A(b)'s have been assigned, therandom variables EST(b)'s are all independent. Hence:

$\begin{matrix}{{{VAR}\left\lbrack {{EST}(L)} \right\rbrack} = {\sum\limits_{b \in L}{{VAR}\left\lbrack {{EST}(b)} \right\rbrack}}} \\{= {\sum\limits_{b \in L}\frac{\gamma \; {A(b)}\left( {{A(b)} + 1} \right)}{2}}} \\{= {\frac{\gamma}{2}\left( {{A(L)} + {\sum\limits_{b \in L}{A^{2}(b)}}} \right)}} \\{\leq {\frac{\gamma}{2}\left( {{A(L)} + {A^{2}(L)}} \right)} \leq {\gamma \; {A^{2}(L)}}}\end{matrix}$

Applying Chebyshev's inequality:

${\Pr \left\lbrack {{{{{EST}(L)} - {A(L)}}} > {\varepsilon \; {A(L)}}} \right\rbrack} \leq \frac{{VAR}\left\lbrack {{EST}(L)} \right\rbrack}{\varepsilon^{2}{A^{2}(L)}} \leq \frac{\gamma}{\varepsilon^{2}} < \delta$

Lemma 5: If level l is incomplete for query y, then with probability atleast

${1 - \frac{\delta}{\alpha - 1}},$COUNT(y)≦(1−ε)2^(l−1)(α−1).

Proof: Since T_(l)≦y, for each bucket b∈S_(l), it must be true thatright(b)<y. Thus each item that was inserted into bucket b in S_(l) mustbe contained in query y. Let S_(l) ¹ denote the internal nodes in S_(l).

COUNT(y)≧A(S _(l))≧A(S _(l) ¹)

Using Lemma 3, the following is true with probability at least

$1 - {\frac{\delta}{2{S_{}^{1}}}.}$A(S _(l) ¹)≧(1−ε)E[EST(S _(l) ¹)]=(1−ε)2^(l) |S _(l) ¹|

Since S_(l) is a binary tree with α nodes, there are (α−1)/2 nodes inS_(l) ¹. Thus,

${\Pr\left\lbrack {{{COUNT}(y)} > {\left( {1 - \varepsilon} \right)\left( {\alpha - 1} \right)2^{ - 1}}} \right\rbrack} > {1 - \frac{\delta}{\alpha - 1}}$

Lemma 6: The estimate returned for COUNT(y) by process 3 has a relativeerror of less than 7ε with probability at least 1-4δ.

Proof: Suppose the process uses level l for its estimate. Let B₁ ⊂Sdenote the set of all nodes b∈S_(l) such that the span of b is a subsetof [0, y]. The estimator uses the sum of the estimated counts of allnodes in B₁. Let B₁ ^(l) ⊂B₁ denote the internal nodes in B₁, and B₁^(L) ⊂B₁ denote the leaves in B₁. Let COÛNT(y) denote the estimatereturned by the process. Let B₂ ⊂S_(l) denote the set of all bucketsb∈S_(l) such that the span of b intersects [0, y], but is not a subsetof [0, y].

COU{circumflex over (N)}T(y)=EST(B ₁)=EST(B ₁ ^(l))+EST(B ₁ ^(L))

A(B ₁)≦COUNT(y)≦A(B ₁)+A(B ₂)  (1)

The following events are defined:

E ₁ is A(B ₂)≦(1+ε)(log y _(max))2^(l)

E ₂ is COUNT(y)≧(1−ε)2^(l−1)(α−1)

E ₃ is EST(B ₁ ^(l))≧(1−ε)A(B ₁ ^(l))

E ₄ is EST(B ₁ ^(L))≧(1−ε)A(B ₁ ^(L))

First consider E₁. Bounding A(B₂), each bucket in B₂ must span y. Therecan be no more than log y_(max) buckets in S_(l) that can span y, andeach such bucket, except for the smallest bucket, must be an internalnode of S_(l). Let ƒ₂ denote the leaf bucket in B₂, if it exists (notethat ƒ₂ may not exist). Let B₂′ denote the internal buckets in B₂. If ƒ₂does not exist, then A(ƒ₂) is defined to be 0.

A(B ₂)=A(ƒ₂)+A(B ₂′)

From Lemma 3, E[A(B₂′)]=2^(l)|B₂′|, and

$\begin{matrix}{{\Pr \left\lbrack {{A\left( B_{2}^{\prime} \right)} > {\left( {1 + \varepsilon} \right)2^{}{B_{2}^{\prime}}}} \right\rbrack} < \frac{\delta}{2{B_{2}^{\prime}}}} & (2)\end{matrix}$

Next, it is shown that A(ƒ₂) is small, with high probability. If ƒ₂ doesnot exist, then A(ƒ₂)=0, so it can be assumed that ƒ₂ exists. In such acase, ƒ₂ cannot be a singleton bucket, since then its span would havebeen completely contained within [0, y]. Consider I(ƒ₂), which is thenumber of stream items that need to be inserted into ƒ₂ to close thecounter. From Lemma 3: Pr[|Y(ƒ₂)−2^(l)|>ε2^(l)]≦δ. Thus, withprobability at least (1−δ), it must be true that Y(ƒ₂)<(1+ε)2^(l).Consequently:

Pr[A(ƒ₂)>(1+ε)2^(l)]≦δ  (3)

Let E₅ denote the event A(B_(2′))≦(1+ε)2^(l)|B₂′| and E₆ denote theevent A(ƒ₂)>(1+ε)2^(l). From Equations 2 and 3:

${\Pr \left\lbrack {\mathcal{E}_{5}\bigwedge\mathcal{E}_{6}} \right\rbrack} \geq {1 - {\Pr \left\lbrack \overset{\_}{\mathcal{E}_{5}} \right\rbrack} - {\Pr \left\lbrack \overset{\_}{\mathcal{E}_{6}} \right\rbrack}} \geq {1 - \delta - \frac{\delta}{2{B_{2}^{\prime}}}} \geq {1 - {1.5\delta}}$

Since |B₂′|<log y_(max), Pr[A(B₂)>(1+ε)(log y_(max))2^(l)]<1.5δ. Thus:

Pr[E ₁]>1−1.5δ  (4)

From Lemma 5:

$\begin{matrix}{{\Pr \left\lbrack \mathcal{E}_{2} \right\rbrack} \geq {1 - \frac{\delta}{\alpha - 1}}} & (5)\end{matrix}$

Since B₁ ^(l) is a set of internal nodes in S_(l), from Lemma 3,

${\Pr \left\lbrack \mathcal{E}_{3} \right\rbrack} = {\Pr\left\lbrack {{{\mathcal{E}}\left( B_{1}^{I} \right)} \geq {\left( {1 - \varepsilon} \right){A\left( B_{1}^{I} \right)}} \geq \frac{A\left( B_{3\;}^{I} \right)}{1 + \varepsilon} \geq {1 - {\frac{\delta}{2}.}}} \right.}$

Thus:

$\begin{matrix}{{\Pr \left\lbrack \mathcal{E}_{3} \right\rbrack} \geq {1 - \frac{\delta}{2}}} & (6)\end{matrix}$

From Lemma 4,

Pr[E ₄]>1−δ  (7)

Suppose that E₁, E₂, E₃, E₄ were true. From Equations 4-7, this is truewith probability at least

${1 - \delta - {0.5\delta} - {1.5\delta} - \frac{\delta}{\alpha - 1}} \geq {1 - {4{\delta.}}}$

$\begin{matrix}\begin{matrix}{{{COU}\hat{N}{T(y)}} = {{{EST}\left( B_{1}^{I} \right)} + {{EST}\left( B_{1}^{L} \right)}}} \\{\geq {{\left( {1 - \varepsilon} \right){A\left( B_{1}^{I} \right)}} + {\left( {1 - \varepsilon} \right){A\left( B_{1}^{L} \right)}(9)}}} \\{\geq {\left( {1 - \varepsilon} \right){A\left( B_{1} \right)}(10)}} \\{{{COUNT}(y)} \leq {{A\left( B_{1} \right)} + {A\left( B_{2} \right)}}} \\{{\leq {{A\left( B_{1} \right)} + {\left( {1 + \varepsilon} \right)\left( {\log \; y_{\max}} \right)2^{l}}}},}\end{matrix} & (8)\end{matrix}$

since E₁ is true. This is at most

${{A\left( B_{1} \right)} + {{{COUNT}(y)}\frac{\left( {1 + \varepsilon} \right)2^{l}\log \; y_{\max}}{\left( {1 - \varepsilon} \right)2^{l - 1}\left( {\alpha - 1} \right)}}},$

since E₂ is true. This equals

${{A\left( B_{1} \right)} + {\frac{\varepsilon \left( {1 + \varepsilon} \right)}{2\left( {1 - \varepsilon} \right)}{{COUNT}(y)}}},$

using

$\alpha = {\frac{4}{\varepsilon}{\left( {{\log \; y_{\max}} + 1} \right).}}$

If ε≦½, then

${\frac{1}{1 - \varepsilon} \leq \left( {1 + {2ɛ}} \right)},$

and thus

$\frac{1 + \varepsilon}{1 - \varepsilon} \leq {\left( {1 + {2\varepsilon}} \right)\left( {1 + \varepsilon} \right)} \leq \left( {1 + {4\varepsilon}} \right)$$\begin{matrix}{{{COUNT}(y)} \leq {{A\left( B_{1} \right)} + {\frac{\left( {\varepsilon + {4\varepsilon^{2}}} \right)}{2}{{COUNT}(y)}}}} \\{{\leq {{A\left( B_{1} \right)} + {\frac{3\varepsilon}{2}{{COUNT}(y)}}}},}\end{matrix}$

where ε≦½ was used.

$\begin{matrix}{{{COUNT}(y)} \leq \frac{A\left( B_{1} \right)}{1 - {1.5\varepsilon}} \leq {\left( {1 + {6\varepsilon}} \right){A\left( B_{1} \right)}}} & (11)\end{matrix}$

From Equations 11 and 8:

$\begin{matrix}{{{COU}\hat{N}{T(y)}} \geq {\frac{1 - \varepsilon}{1 + {6\varepsilon}}{{COUNT}(y)}}} \\{\geq {\left( {1 - \varepsilon} \right)\left( {1 - {6\varepsilon}} \right){{COUNT}(y)}}} \\{\geq {\left( {1 - {7\varepsilon}} \right){{COUNT}(y)}}}\end{matrix}$

This proves that the estimate is within a 7ε relative error of COUNT.The other direction is proved similarly, by noting the following.

$\begin{matrix}{{{COU}\hat{N}{T(y)}} = {{{EST}\left( B_{1}^{I} \right)} + {{EST}\left( B_{1}^{L} \right)}}} \\{\leq {\frac{B_{1}^{I}}{1 - \varepsilon} + {\left( {1 + \varepsilon} \right)B_{1}^{L}}}} \\{\leq {{\left( {1 + {2\varepsilon}} \right)B_{1}^{I}} + {\left( {1 + \varepsilon} \right)B_{1}^{L}\mspace{14mu} {Using}\mspace{14mu} \varepsilon}} \leq {1/2}} \\{\leq {\left( {1 + {2\varepsilon}} \right)B_{1}}} \\{\leq {\left( {1 + {2\varepsilon}} \right){{COUNT}(y)}\mspace{14mu} {Using}\mspace{14mu} {Equation}\mspace{14mu} 1}}\end{matrix}$

Thus, with probability at least 1-4δ, the estimated count is within a 7εrelative error of the actual.

Theorem 2: processes 1-3 yield an (ε, δ)-approximation for thecorrelated basic count, using space (in bits)

$O\left( {\frac{1}{\varepsilon}\left( {\log^{2}U} \right){\left( {{\log \mspace{14mu} \log \; U} + {\log \left( {1/\varepsilon} \right)} + {\log \left( {1/\delta} \right)}} \right).}} \right.$

The space complexity is within a factor of O(log logU+log(1/ε)+log(1/δ))) of the optimal. The amortized processing time perrecord can be made O(log n). In the above expression, it is assumed thatlog(y_(max))−O(log U).

Proof: The space complexity and correctness follow from Lemma 2 andLemma 6, respectively. In prior work, it was shown that any datastructure for basic counting for a (synchronous) sliding window with anε relative error needs

$\Omega\left( \frac{\log^{2}U}{\varepsilon} \right)$

bits of space. Since basic counting over a synchronous sliding window isa special case of correlated basic count estimation, the space lowerbound of

$\Omega\left( \frac{\log^{2}U}{\varepsilon} \right)$

holds here too. Hence, the achieved space bound is within a factor ofO(log log U+log(1/ε)+log(1/δ))) of the optimal. To achieve O(log n)amortized processing time, observe that there are O(log n) datastructures S₀, S₁, . . . , S_(l) _(max) , each containing O(ε⁻¹ log n)buckets, each containing a probabilistic counter. These probabilisticcounters are packed into words of size Θ(log n), that is, the standardRAM model. Processing a batch of O(ε⁻¹ log n) counts together asfollows. First, the batch is sorted in order of non-decreasingy-coordinate. This can be done in O(ε⁻¹ log n(log 1/ε+log log n)) time.Then, the following for each S_(i) is done, sequentially, so as to keepwithin the space bound. For data structure S_(i), unpack it, puttingeach probabilistic counter into a word. This can be done in O(ε⁻¹ log²n) bits of space and O(ε⁻¹ log n) time. Further, the data structure isunpacked and represented as a list in a pre-order traversal of the treethat S_(i) represents. Then, an embodiment walks through the list andupdates the appropriate buckets. Since the updates are sorted inincreasing y-value and the list is represented as a pre-order traversal,this can be done in O(ε¹ log n) time. Hence, as an embodiment rangesover all S_(i), the total processing time is O(ε⁻¹ log n(log 1/ε+log logn+log n))=O(ε⁻¹ log² n), or equivalently, the amortized update time isO(log n).

General Scheme for Correlated Aggregation

A general scheme for constructing a summary for correlated aggregationfor any aggregation function that satisfies a certain set of propertiesis described. For any such aggregation function, it is described how toreduce the construction of a sketch for correlated aggregation to theconstruction of a sketch for aggregation over the entire stream. Thisreduction allows the use of previously known stream summaries foraggregation over streams in constructing summaries for correlatedaggregation.

An embodiment employs this scheme to construct small space summaries forestimating correlated frequency moments over a data stream. For k>0, thek-th frequency moment of a stream of identifiers, each assumed to be aninteger from {1, . . . , m}, is defined as F_(k)=Σ_(i=1) ^(m)ƒ_(i) ^(k),where ƒ_(i) is the number of occurrences of the i-th item. Theestimation of the frequency moments over a data stream has been thesubject of much study; however, embodiments provide processing forestimating the correlated frequency moments with provable guarantees onthe relative error. The memory requirements are optimal up to smallfactors, namely factors that are logarithmic in m and the errorprobability δ, and factors that are polynomial in the relative errorparameter ε.

The technique for general correlated aggregation described herein buildson prior work for the correlated estimation of the basic count. Whilethe basic data structure used in prior work is an exact counter,embodiments rather employ a “sketch”, which can accurately estimate anaggregate on a stream, and provide a (possibly) probabilistic guaranteeon the correctness. The error due to the randomized sketches employed byembodiments is analyzed, as well as the combination of differentsketches. A central new issue is that of dependency between randomvariables.

Also presented in detail are approaches for the first space-efficientprocessing for correlated aggregates of the number of distinct elements(F₀), and other aggregates related to frequency moments, such as theF_(k)-heavy hitters and rarity. A technique for achieving loweramortized update time is provided. For correlated F₂, the update time isÕ(log n) per stream update.

General Streaming Models and Lower Bounds

The case of streams where items can have an associated positive ornegative integer weight is described. Allowing negative weights isuseful for analyzing the symmetric difference of two data sets, sincethe items in the first data set can be inserted into the stream withpositive weights, while the items from the second data set can beinserted into the stream with negative weights.

In this model, each stream element is a 3-tuple (x_(i), y_(i), z_(i)),where z_(i) specifies the weight of the item. It is shown that, even ifz_(i)∈{−1, 1} for all i, then for a general class of functions thatincludes the frequency moments, any summary that can estimate correlatedaggregates accurately and that is constructed in a single pass must usememory that is linear in the size of the stream. This is to becontrasted with the estimation of frequency moments over a stream in thenon correlated case, where it is known how to estimate these aggregatesin sublinear space even in the presence of positive and negativeweights.

Also described is a model with arbitrary positive and negative weightsin which multiple passes over the stream are allowed. This more generalmodel allows the processing to make a small (but more than one) numberof passes over the stream and store a small summary of what is seen. Ata later time a query is made and must be answered using only thesummary. Such a setting arises if data is collected and stored fromprevious days for the purpose of creating a summary, but is then deletedor archived while only the summary is retained.

In this case, a smooth pass-space tradeoff is obtained for theseproblems, in that with a logarithmic number of passes there arespace-efficient algorithms for a large class of correlated aggregateseven with negative weights, but with fewer passes no suchspace-efficient algorithms can exist.

Aggregation Over Asynchronous Streams

A closely related problem is that of computing aggregates over a slidingwindow on an asynchronous stream. In this scenario, a stream of (ν₁, t₂)tuples, where ν is a data item, and t is the timestamp at which it wasgenerated. Due to asynchrony in the transmission medium, it is possiblethat the stream elements do not arrive in the order of the time-stamps.In other words, it is possible that t₁<t₂, but that (ν₁, t₁) is receivedlater than (ν₂, t₂). This was not possible in the traditional definitionof count-based or time-based sliding windows. There is a straightforwardreduction from the problem of aggregating over asynchronous streams tothat of computing correlated aggregates, and vice versa, from previouswork. Hence, all of the results for correlated aggregates describedherein apply also to aggregation over a sliding window on anasynchronous stream with essentially the same space and time bounds.Thus, embodiments achieve the first low-space solutions for asynchronousstreams for a wide class of statistics.

Correlated Estimation of a Class of Statistics

Embodiments provide for the estimation of correlated, aggregates for anyaggregation function that satisfies a set of properties. Consider anaggregation function ƒ that takes as input a multi-set of real numbers Rand returns a real number ƒ(R). In the following, the term “set of realnumbers” is used to mean a “multi-set of real numbers”. Also, union ofsets implies a multi-set union, when the context is clear.

For any set of tuples of real numbers T={(x_(i), y_(i))|1≦i≦n} and realnumber y, let ƒ(T, c) denote the correlated aggregate ƒ({x_(i)|((x_(i),y_(i))∈T)

(y_(i)≦c)}). For any function ƒ satisfying the following properties, itis shown herein how a reduction from the problem of space-efficientestimation of correlated aggregate ƒ(T, c) to the problem ofspace-efficient estimation of ƒ in an uncorrected manner on the entireset R.

The following definition of an (ε, δ) estimator is used herein.Definition 2: Given parameters ε, δ, where 0<ε<1, and 0<δ<1, an (ε, δ)estimator for a number Y is a random variable X such that, withprobability at least 1−δ, the following is true:

(1−ε)Y≦X≦(1+ε)Y.

In the following description, the term “sketching function” denotes acompression of the input set with certain properties. More precisely, ƒhas a sketching function sk_(ƒ)(ν, γ, R) if:

Using sk_(ƒ)(ν, γ, R) it is possible to get an (ν, γ)-estimator of ƒ(R);and

For two sets R₁and R₂, given sk(ν, γ, R₁) and sk(ν, γ, R₂), it ispossible to compute sk(ν, γ, R₁∪R₂).

Many functions ƒ have sketching functions. For instance, the secondfrequency moment F₂ and the k-th frequency moment F_(k) have sketchesdue to prior work. In these two examples, the sketching function isobtained by taking random linear combinations of the input.

Embodiments require the following conditions from ƒ in order toconstruct a sketch for estimating correlated aggregates. Theseconditions intuitively correspond to “smoothness” conditions of thefunction ƒ, bounding how much ƒ can change when new elements areinserted or deleted from the input multi-set. Informally, the less thefunction value is sensitive to small changes, the easier it is to applyto estimating correlated aggregates.

-   -   I. ƒ(R) is bounded by a polynomial in |R|    -   II. For sets R₁ and R₂, ƒ(R₁∪R₂)≧ƒ(R₁)+ƒ(R₂)    -   III. There exists a function c₁ ^(ƒ)(·) such that for sets R₁, .        . . R_(j), if ƒ(R_(i))≦α for all i=1 . . . j, then: ƒ(∪_(i=1)        ^(j)R_(i))≦c₁ ^(ƒ)(j)·α    -   IV. For ε<1, there exists a function c₂ ^(ƒ)(ε) with the        following properties. For two sets A and B such that B⊂A, if        ƒ(B)≦c₂ ^(ƒ)(ε)·ƒ(A), then ƒ(A−B)≧(1−ε)ƒ(A)    -   V. ƒ has a sketching function sk_(ƒ)(γ, ν, R) where γ∈(0, 1) and        ν∈(0, 1).

For any function ƒ with a sketch sk_(ƒ) with the above properties, it isshown herein how to construct a sketch sk_(ƒ) ^(cor)(ε, δ, T) forestimating the correlated aggregate ƒ(T, c) with the followingproperties:

-   -   A. Using sk_(j) ^(cor)(ε, δ, T), it is possible to get an (ε,        δ)-estimator of ƒ(T, c) for any real c>0    -   B. For any triple (x, y), using sk_(ƒ) ^(cor)(γ, ε, T), it is        possible to construct sk_(ƒ) ^(cor)(γ, ε, T∪{(x, y)}).

Processing Description

Referring to FIG. 5, let ƒ_(max) denote an upper bound on the value ofƒ(·,·) over all input streams considered. The process (FIGS. 5-7) uses aset of levels l=0, 1, 2, . . . , l_(max), where l_(max) is such that2^(l) ^(max) >ƒ_(max) for input stream T and real number c. FromProperty 1, it follows that l_(max) is logarithmic in the stream size.

Parameters α, γ, ν are chosen as follows:

${\alpha = \frac{64{c_{1}^{f}\left( {\log \; y_{\max}} \right)}}{c_{2}^{f}\left( {\varepsilon/2} \right)}},{v = \frac{\varepsilon}{2}},{\gamma = \frac{\delta}{4{y_{\max}\left( {l + 1} \right)}}},$

where y_(max) is the largest possible y value.

Without loss of generality, assume that y_(max) is of the form 2^(β)−1for some integer β. The dyadic intervals within [0, y_(max)] are definedinductively as follows:

-   -   (1) [0, y_(max)] is a dyadic interval;    -   (2) If [a, b] is a dyadic interval and a≠b, then [a, (a+b−1)/2]        and [(a+b+1)/2, b] are also dyadic intervals.

Within each level l, from 0 to l_(max), there is a “bucket” for eachdyadic interval within [0, y_(max)]. Thus, there are 2y_(max)−1 bucketsin a single level. Each bucket b is a triple

k(b), l(b), r(b)

, where [l(b), r(b)] is a dyadic interval, that, corresponds to a rangeof y values that this, bucket is responsible for, and k(b) is definedbelow.

When a stream element (x, y) arrives, it is inserted into each levell=0, . . . , l_(max). Within level l, it is inserted into exactly onebucket, as described in Process 5. For a bucket b in level l, let S(b)denote the (multi-)set of stream items that were inserted into b. Then,k(b)=sk_(ƒ)(ν, γ, S(b)) is a sketch of S(b).

Within each level, no more than α of the 2y_(max)−1 buckets are actuallystored. In the process, S_(l) denotes the buckets that are stored inlevel l. The level S₀ is a special level that just consists ofsingletons. Among the buckets that are not stored, there are two typesof buckets, those that were discarded in Process 5 (see the “Check foroverflow” comment in Process 5), and those that were never used by theprocess. The above three types of buckets are termed “stored”,“discarded”, and “empty”, respectively. Note that S(b) is defined foreach of these three types of buckets (if b is an empty bucket, then S(b)is defined as the null set φ).

The buckets in S_(l) are organized into a tree, induced by the relationbetween the dyadic intervals that these buckets correspond to.

The initialization for the process for a general function is describedin Process 4, illustrated in FIG. 5. The update and query processing aredescribed in Processes 5 and 6, illustrated in FIGS. 6(A-B) and 7,respectively.

Theorem 3 (Space Complexity): The space complexity of the sketch forcorrelated estimation is

${O\left( {\frac{{c_{1}^{f}\left( {\log \; y_{\max}} \right)} \cdot \left( {\log \; f_{\max}} \right)}{c_{2}^{f}\left( \frac{\varepsilon}{2} \right)} \cdot {len}} \right)},$

where

${len} = {{{sk}_{f}\left( {\frac{\varepsilon}{2},\frac{\delta}{4{y_{\max}\left( {2 + {\log \; f_{\max}}} \right)}},S} \right)}}$

is the number of bits needed to store the sketch.

Proof: There are no more than 2+log ƒ_(max) levels and in each level l,S_(l) stores no more than α buckets. Each bucket b contains sk_(ƒ)(ν, γ,S(b)). The space complexity is α(2+log ƒ_(max)) times the spacecomplexity of sketch sk_(ƒ). Here it is assumed that the space taken bysk_(ƒ) is larger than the space required to store l(b) and r(b).

Process Correctness

Let S denote the stream of tuples observed up to a given point. Supposethe required correlated aggregate is ƒ(S, c). Let A be the set

{x _(i)|((x _(i) , y _(i))∈S)

(y _(i) ≦c)}.

Thus, ƒ(S, c)=ƒ(A).

For level l, 0≦l≦l_(max), B₁ ^(l) and B₂ ^(l) are defined as follows.Let B₁ ¹ denote the set of buckets b in level l such that span(b)⊂[0,c]. Let B₂ ^(l) denote the set of buckets b in level l such thatspan(b)⊂/[0, c], but span(b) has a non-empty intersection with [0, c].

Note that for each level l, B₁ ^(l) and B₂ ^(l) are uniquely determinedonce the query ƒ(S, c) is fixed. These do not depend on the actions ofthe process. This is an important property used, which allows the choiceof which buckets to use during estimation to be independent of therandomness in the data structures. Further, note that only a subset ofB₁ ^(l) and B₂ ^(l) is actually stored in S_(l).

Consider any level l, 0≦l≦l_(max). For bucket b, recall that S(b)denotes the set of stream items inserted into the bucket until the timeof the query. For bucket b∈S_(l), let ƒ(b) denote ƒ(S(b)). Letest_(ƒ)(b) denote the estimate of ƒ(b) obtained using the sketch k(b).If S(b)=φ, then ƒ(b)=0 and est_(ƒ)(b)=0. Thus note that ƒ(b) andest_(ƒ)(b) are defined no matter whether b is a stored, discarded, or anempty bucket. Further, for a set of buckets B in the same level, letS(B)=∪_(b∈B)S(b), and let ƒ(B)=ƒ(S(B)). Let est_(ƒ)(B) be the estimatefor ƒ(B) obtained through the composition of all sketches in ∪_(b∈B)k(b)(by property V, sketches can be composed with each other).

Definition 3: Bucket b is defined to be “good” if(1−ν)ƒ(b)≦est_(ƒ)(b)≦(1+ν)ƒ(b). Otherwise, b is defined to be “bad”.

Let G denote the following event: each bucket b in each level 0 . . .l_(max) is good.

Lemma 7:

${\Pr \lbrack G\rbrack} \geq {1 - \frac{\delta}{2}}$

Proof: For each bucket b, note that est_(ƒ)(b) is a (ν, γ)-estimator forƒ(b). Thus, the probability that b is bad is no more than γ. Noting thatthere are less than 2y_(max) buckets in each level, and l_(max)+1 levelsin total, and applying a union bound, yields:

${{\Pr {\overset{\_}{G}}} \leq {2{y_{\max}\left( {l_{\; \max} + 1} \right)}\gamma}} = \frac{\delta}{2}$

Lemma 8: For any level l, S(B₁ ^(l))⊂A⊂S(B₁ ^(l)∪B₂ ^(l)).

Proof: Every bucket b∈B₁ ^(l) must satisfy span(b)∈[0, c]. Thus everyelement inserted into B₁ ^(l) must belong in A. Hence S(B₁ ^(l))⊂A. Eachelement in A has been inserted into some bucket in level l (it ispossible that some of these buckets have been discarded). By thedefinitions of B₁ ^(l and B) ₂ ^(l), an element in A cannot be insertedinto any bucket outside of B₁ ^(l)∪B₂ ^(l). Thus A⊂S(B₁ ^(l)∪B₂ ^(l)).

Using Lemma 8 and Condition II on ƒ yields the following for any levell:

ƒ(B ₁ ^(l))≦ƒ(A)≦ƒ(B ₁ ^(l) ∪B ₂ ^(l)).  (12)

Lemma 9: Conditioned on event G, process 6 does not output FAIL in step1.

Proof: Consider l_(max). Y_(l) _(max) >c if event G occurs. Observe thatY_(l) _(max) is initialized to ∞ in Process 4. Its value can only changeif the root b of S_(l) _(max) closes. For this to happen, there must beest(k(b))≧2^(l) ^(max) ⁺¹. But 2^(l) ^(max) ⁺¹>2ƒ_(max), which meansthat est(k(b)) does not provide a (1+ε)-approximation. This contradictsthe occurrence of event G. Hence, Y_(l) _(max) >c and so Process 6 doesnot output FAIL in step 1.

Let l* denote the level used by Process 6 to answer the query ƒ(S, c).

Lemma 10: If l*≧1 and G is true, then

ƒ(B ₂ ^(l*))≦c ₁ ^(ƒ)(log y _(max))2^(l*+2).

Proof: First, note that there can be no singleton buckets in B₂ ^(l*) bydefinition of B₂ ^(l) for a level l. Thus, for each bucket b∈B₂ ^(l*),est_(ƒ)(b)≦2^(l*+1). Because G is true, for every bucket b∈B₂ ^(l*), bis good, so that

${f(b)} \leq {\frac{2^{l^{*} + 1}}{1 - v}.}$

Next, note that there are no more than log y_(max) buckets in B₂ ^(l*),since there can be only one dyadic interval of a given size thatintersects [0, c] but is not completely contained within [0, c].

From Property III:

${f\left( B_{2}^{l*} \right)} = {{f\left( {\bigcup_{b \in B_{2}^{l^{*}}}{S(b)}} \right)} \leq {{c_{1}^{f}\left( {\log \; y_{m\; {ax}}} \right)} \cdot {\frac{2^{l^{*} + 1}}{1 - v}.}}}$

Since ν≦½, the desired result is obtained.

Lemma 11: If l*≧1 and G is true, then:

ƒ(A)≧α2^(l*−4).

Proof: Since the process used level l* for answering the query, it mustbe the case that there are buckets in S_(l*−1) that had an intersectionwith [0, c] but were discarded from the data structure. It follows thatthere are at most log y_(max) buckets b∈S_(l*−1) such that span(b)⊂/[0,c]. For the remaining buckets b∈S_(l*−1), it must be true thatspan(b)⊂[0, c]. If S_(l*−1) is viewed as a binary tree with α nodes,according to the ordering between the different dyadic intervals, thenS_(l*−1) must have (α−1)/2 internal nodes. Suppose I denoted the set ofbuckets in b∈S_(l*−1) such that b is an internal node, and span(b)⊂[0,c]. Thus |I|≧(α−1)/2−log y_(max). Since G is true, for any bucket

${b \in I},{{f(b)} \geq {\frac{2^{l^{*} - 1}}{1 + v}.}}$

Using property II repeatedly yields:

${f(A)} \geq {f(I)} \geq \frac{{I}2^{l^{*} - 1}}{1 + v}$

Using ν<1, and for an appropriately large value of α, ((α−1)/2−logy_(max))≧α/4. Combining the above yields the following:

${{f(A)} \geq \frac{\alpha \; 2^{l^{*}}}{2 \cdot 4 \cdot 2}} = {2^{l^{*} - 4}{\alpha.}}$

Theorem 4: When presented with a query for ƒ(S, c), let est denote theestimate returned by the process. Then, with probability at least 1−δ:

(1−ε)ƒ(S, c)≦est≦(1+ε)ƒ(S, c).

Proof: If l*=0, then all elements (x, y)∈S such that y≦c are stored inS₀. In this case, the theorem follows by the definition of event G andLemma 1. Otherwise, est=est_(ƒ)(B₁ ^(l*)), and ƒ(S, c)=ƒ(A). First notethat in level l*, none of the buckets in B₁ ^(l*) have been discarded.Thus each bucket b∈B₁ ^(l*) is either empty or is stored. Thus, it ispossible to execute line 7 in Process 6 correctly to construct a sketchof S(B₁ ^(l*)). From property (b) of sketching functions yields a sketchsk(ν, γ, S(B₁ ^(l*))).

Let ε₁ denote the event (1−ν)ƒ(B₁ ^(l*))≦est_(ƒ)(B₁ ^(l*))≦(1+ν)ƒ(B₁^(l*)). Thus:

Pr[ε ₁]≧1−γ

The following is conditioned on both ε₁ and G occurring. From Equation1: ƒ(A)≦ƒ(B₁ ^(l*)∪B₂ ^(l*)). From Lemmas 4 and 5:

$\frac{f\left( B_{2}^{l^{*}} \right)}{f(A)} \leq \frac{{c_{1}^{f}\left( {\log \; y_{m\; {ax}}} \right)}2^{l^{*} + 2}}{\alpha \; 2^{l^{*} - 4}} \leq \frac{{c_{1}^{f}\left( {\log \; y_{m\; {ax}}} \right)}2^{6}}{\alpha} \leq {c_{2}^{f}\left( \frac{\varepsilon}{2} \right)}$

where the value of α has been substituted.

Since (A−B₁ ^(l*))⊂B₂ ^(l*), the following is true:

$\frac{f\left( B_{2}^{l^{*}} \right)}{f(A)} \leq {{c_{2}^{f}\left( \frac{\varepsilon}{2} \right)}.}$

Using Property IV yields:

${f\left( B_{1}^{l^{*}} \right)} = {{f\left( {A - \left( {A - B_{1}^{l^{*}}} \right)} \right)} \geq {\left( {1 - \frac{\varepsilon}{2}} \right){f(A)}}}$

Conditioned on ε₁ and G both being true:

est _(ƒ)(B ₁ ^(l*))≧(1−ε/2)(1−ν)ƒ(A)≧(1−ε)ƒ(A)

This proves that conditioned on G and ε₁, the estimate returned is nevertoo small. For the other direction, note that conditioned on ε₁ beingtrue:

est _(ƒ)(B ₁ ^(l*))≦(1+ν)ƒ(B ₁ ^(l*))≦(1+ν)ƒ(A)≦(1+ε)ƒ(A)

where ƒ(B₁ ^(l*))≦ƒ(A) has been used, and ν<ε.

To complete the proof of the theorem, note that:

$\geq {1 - {\Pr \left\lbrack \overset{\_}{G} \right\rbrack} - {\Pr \left\lbrack \overset{\_}{ɛ_{1}} \right\rbrack}} \geq {1 - \frac{\delta}{2} - {\gamma \mspace{14mu} {using}\mspace{14mu} {Lemma}\mspace{14mu} 7}} \geq {1 - {\delta \mspace{14mu} {using}\mspace{14mu} \gamma}} < {\delta/2.}$

Frequency Moments F_(k)

The general technique presented herein can yield a data structure forthe correlated estimation of the frequency moments F_(k), k≧2.

Fact 1: (Hölder's Inequality) For vectors a and b of the same dimension,and any integer k≧1,

a, b

≦∥a∥_(k)·∥b∥_(k/(k−1)).

Lemma 12: For sets S_(i), i=1 . . . j, if F_(k)(S_(i))≦β for each i=1 .. . j, then F_(k)(∪_(i=1) ^(j)S_(i))≦j^(k)β.

Proof: Fact 1, j-dimensional vectors a and b implies that |

a, b

|^(k)≦∥a∥_(k) ^(k)·∥b∥_(k/(k−1)) ^(k). Setting B=(1, 1, . . . , 1), itfollows that (a₁+ . . . +a_(j))^(k)≦j^(k−1)(a₁ ^(k)+ . . . +a_(j) ^(k)).Hence, it follows that

${F_{k}\left( {\bigcup_{i = 1}^{j}{\bigcup S_{i}}} \right)} \leq {j^{k - 1}{\sum\limits_{i = 1}^{j}{F_{k}\left( S_{i} \right)}}} \leq {j^{k}{\beta.}}$

Lemma 13: If F_(k)(B)≦(ε/(3k))^(k)F_(k)(A), thenF_(k)(A∪B)≦(1+ε)F_(k)(A).

Proof: Suppose A and B have support on {1, 2, . . . , n}. Let a and b bethe characteristic vectors of sets A and B, respectively. Using Fact 1yields:

$\begin{matrix}{{F_{k}\left( {A\bigcup B} \right)} = {\sum\limits_{i = 1}^{n}\left( {a_{i} + b_{i}} \right)^{k}}} \\{= {{F_{k}(A)} + {F_{k}(B)} + {\sum\limits_{i = 1}^{n}{\sum\limits_{j = 1}^{k - 1}{\begin{pmatrix}k \\j\end{pmatrix}a_{i}^{j}b_{i}^{k - j}}}}}} \\{= {{F_{k}(A)} + {F_{k}(B)} + {\sum\limits_{j = 1}^{k - 1}{\begin{pmatrix}k \\j\end{pmatrix}{\sum\limits_{i = 1}^{n}{a_{i}^{j}b_{i}^{k - j}}}}}}} \\{\leq {{F_{k}(A)} + {F_{k}(B)} + {\sum\limits_{j = 1}^{j - 1}{\begin{pmatrix}k \\j\end{pmatrix}\left( {\sum\limits_{i = 1}^{n}\left( a_{i}^{j} \right)^{k/j}} \right)^{\frac{j}{k}}}}}} \\{\left( {\sum\limits_{i = 1}^{n}\left( b_{i}^{k - j} \right)^{k/{({k - j})}}} \right)} \\{= {{F_{k}(A)} + {F_{k}(B)} + {\sum\limits_{j = 1}^{k - 1}{\begin{pmatrix}k \\j\end{pmatrix}{F_{k}(A)}^{\frac{j}{k}}{F_{k}(B)}^{\frac{k - j}{k}}}}}} \\{\leq {{F_{k}(A)} + {F_{k}(B)} + {\sum\limits_{j = 1}^{k - 1}{\begin{pmatrix}k \\j\end{pmatrix}{F_{k}(A)}\left( \frac{\varepsilon}{3k} \right)^{k - j}}}}} \\{\leq {{\left( {1 + {\varepsilon/3}} \right){F_{k}(A)}} + {{F_{k}(A)}{\sum\limits_{j = 1}^{k - 1}{\begin{pmatrix}k \\j\end{pmatrix}\left( {\varepsilon/\left( {3k} \right)} \right)^{k - j}}}}}} \\{\leq {{\left( {1 + {\varepsilon/3}} \right){F_{k}(A)}} + {{F_{k}(A)}\left( {1 + {\varepsilon/\left( {3k} \right)}} \right)^{k}} - {F_{k}(A)}}} \\{\leq {{\left( {1 + {\varepsilon/3}} \right){F_{k}(A)}} + {{F_{k}(A)}\left( {1 + {2{\varepsilon/3}}} \right)} - {F_{k}(A)}}} \\{{\leq {\left( {1 + \varepsilon} \right){F_{k}(A)}}},}\end{matrix}$

where (1+x)^(y)≦e^(xy) was used for all x and y, and e^(z)≦1+2z for z≦½.This completes the proof.

Lemma 14: If C⊂D, and F_(k)(C)≦(ε/(9k))^(k)F_(k)(D), thenF_(k)(D−C)≧F_(k)(D).

Proof: It is know that for any two sets A and B,F_(k)(A∪B)≦2^(k)(F_(k)(A)+F_(k)(B)).

F _(k)(D)=F _(k)((D−C)∪(C)≦2^(k)(F _(k)(D−C)+F _(k)(C)), which leads to:

F _(k)(D−C)≧F _(k)(D)/2^(k) −F _(k)(C)≧(9k/ε)^(k)(½^(k))^(k)−1)F_(k)(C)≧(3k/ε)^(k) F _(k)(C). Thus, F _(k)(C)≦(ε/3k)^(k) F _(k)(D−C).Applying Lemma 7 yields

F _(k)(C∪(D−C))≦(1+ε)F _(k)(D−C). Thus,

F _(k)(D−C)≧F _(k)(D)/(1+ε)≧(1−ε)F _(k)(D).

Theorem 5: For parameters 0<ε<1 and 0<δ<1, there is a sketch for an (ε,δ)-estimation of the correlated aggregate F_(k) on a stream of tuples oftotal length n, using spacer n^(1−2/k)poly(ε⁻¹ log(n/δ)).

Proof: From Lemma 12, c₁ ^(F) ^(k) (j)=j^(k). From Lemma 8, c₂ ^(F) ^(k)(ε)=(ε/(9k))^(k). Using these in Theorem 3 yields c₁ ^(F) ^(k) (logy_(max))=(log y_(max))^(k), and c₂ ^(ƒ)(ε/2)=(ε/(18k))^(k). Using thesketches for F₂ and F_(k), k>2 from prior work, the above result isobtained.

The space can be improved to r^(1−2/k)poly(ε⁻¹ log(n/δ)), where r is thenumber of distinct x_(i)-values in the stream. In the worst-case,though, r could be Θ(n). The dependence is made more explicit for thecase of F₂.

Lemma 15: For parameters 0<ε<1 and 0<δ<1, there is a sketch for (ε, δ)error correlated estimation of F₂ on a stream of tuples of total lengthn, using space O(ε⁻⁴(log(1/δ)+log y_(max))(log² y_(max))(log² ƒ_(max)))bits. The amortized update time is O(log ƒ_(max)·log y_(max)).

Proof: The space taken by a sketch for an (ε, δ) estimator for F₂ on astream is O((log ƒ_(max))(1/ε²)log(1/δ)) bits. From the proof of Theorem5, c₁ ^(F) ² (j)=j², and c₂ ^(F) ² (ε)=(ε/18)².

Using the above in Theorem 3, the space is O(ε⁻⁴ log² ƒ_(max) log²y_(max)(log 1/δ+log y_(max))) bits.

To get O(log ƒ_(max)(log 1/δ+log y_(max))) amortized processing time,observe that there are O(log ƒ_(max)) data structures S_(i), eachcontaining O(ε⁻² log² y_(max)) buckets, each holding a sketch of O(ε⁻²log ƒ_(max)(log 1/δ+log y_(max))) bits. A batch of O(ε⁻² log² y _(max))updates is processed at once. The batch is first sorted in order ofnon-decreasing y-coordinate. This can be done in O(ε⁻² log² y_(max)(log1/ε+log log y_(max))) time. Then, the following processing is done foreach S_(i). A pre-order traversal of the buckets in S_(i) is conductedand the appropriate buckets are updated. Importantly, each bucketmaintains an update-efficient AMS sketch (shown in prior work), whichcan be updated in time O(log 1/δ+log y_(max)). Since the updates aresorted in increasing y-value and the list is represented as a pre-ordertraversal, the total time to update S_(i) is O(ε⁻² log² y_(max)(log1/δ+log y_(max))). The time to update all the S_(i) is O(log ƒ_(max))times this. So the amortized time is O(log ƒ_(max)(log 1/δ+logy_(max))).

Other Useful Statistics

While many aggregation functions satisfy the properties described above,some important ones do not. However, in many important remaining cases,these aggregation functions are related to aggregation functions that dosatisfy these properties, and the mere fact that they are related in theappropriate way enables efficient estimation of the correspondingcorrelated aggregate.

The correlated F₂-heavy hitters can be computed, as well as the rarity(defined below) by relating these quantities to F₂ and F₀, respectively.

For example, in the correlated F₂-heavy hitters problem with y-bound ofc and parameters ε, φ, 0<ε<φ<1, letting F₂(c) denote the correlatedF₂-aggregate with y-bound of c, to return all x for which |{x_(i),y_(i))|x_(i)=x

y_(i)≦c}|²≧φF₂(c), and no x for which |{(x_(i), y_(i))|x_(i)=x

y_(i)≦c}|²≦(φ−ε)F₂(c). To do this, the same data structures S_(i) may beused as used for estimating the correlated aggregate F₂. However, foreach S_(i) and each bucket in S_(i) maintain a process is maintained forestimating the squared frequency of each item inserted into the bucketup to an additive (ε/10)·2^(i). To estimate the correlated F₂-heavyhitters, for each item an additive (ε/10)·F₂(c) approximation to itssquared frequency is obtained by summing up the estimates provided forit over the different buckets contained in [0, c] in the data structureS_(i) used for estimating F₂(c). Since only an ε/10 fraction of F₂(c)does not occur in such buckets, the list of all heavy hitters isobtained this way, and no spurious ones.

In the rarity problem, the problem is to estimate the fraction ofdistinct items that occur exactly once in the multi-set. The ideas forestimating rarity are similar, where the same data structures S_(i) aremaintained for estimating the correlated aggregate F₀, but in eachbucket maintain data structures for estimating the rarity of itemsinserted into that bucket.

Number of Distinct Elements

The number of distinct elements in a stream, also known as the zenithfrequency moment F₀, is a fundamental and widely studied statistic ofthe stream. Here, the correlated estimation of the number of distinctelements in a stream is described. Consider a stream of (x, y) tuples,where x∈{1, . . . , m} and y∈{1, y_(max)}. The process employed by anembodiment is an adaptation of the process for estimating the number ofdistinct elements within a sliding window of a data stream, due to priorwork. Similar to prior work, the process for correlated estimation of F₀is based on “distinct sampling”, or sampling based on the hash values ofthe item identifiers. Multiple samples, S₀, S₁, . . . , S_(k), aremaintained, where k=log m. Suppose that for simplicity, there is a hashfunction h that maps elements in {1, . . . , m} to the real interval [0,1]. This assumption of needing such a powerful hash function can beremoved, as shown in prior work.

The process of prior work proceeds as follows. Stream items are placedin these samples S_(i) in the following manner. (A) Each item (x, y) isplaced in S₀. (B) For i>0, an item (x, y) is placed in level i if

${h(x)}<={\frac{1}{2^{i}}.}$

Note that if an item x is placed in level i, it must have been placed inlevel i−1 also.

Since each level has a limited space budget, say α, a way to discardelements from each level is needed. The process according to anembodiment differs from prior work in the following aspect of how todiscard elements from each level. For correlated aggregates, maintainedin S_(i) are only those items (x, y) that (1) have an x value that issampled into S_(i), and (2) have the smallest values of y among all theelements sampled into S_(i). In other words, it is a priority queueusing the y values as the weights, whereas in prior work, each level wasa simple FIFO (first-in-first-out) queue.

It can be shown that the above scheme of retaining those elements with asmaller value of y, when combined with the sampling scheme in priorwork, yields an (ε, δ) estimator for the correlated distinct counts.

Theorem 6: Given parameters 0<ε<1 and 0<δ<1, there is a streamingprocess that can maintain a summary of a stream of tuples (x, y), whereat x∈{1, . . . , m} and y∈{1, y_(max)} such that: (1) The space of thesummary is

$\begin{matrix}{{{O\left( {{\log \; m} + {\log \; y_{{ma}\; x}}} \right)}\; \frac{\log \; m}{\varepsilon^{2}}\log \; {1/\delta}\mspace{14mu} {bits}};} & (2)\end{matrix}$

The summary can be updated online as stream elements arrive; and (3)Given a query y₀, the summary can return, an (ε, δ)-estimator of thenumber of distinct x values among all tuples (x, y) with y valuessatisfying y≧y₀.

Example Implementations Correlated F2, Second Frequency Moment

An example process for correlated F₂ estimation was implemented inPython. The following four data sets were used for example evaluations.(1) Packet traces of Ethernet traffic on a LAN and a WAN. The relevantdata here is the number of bytes in the packet, and the timestamp on thepacket (in milliseconds). This data set is referred to as the “Ethernet”data set. This data set has 2 million packets, and was constructed bytaking two packet traces and combining them by interleaving. (2) TheUniform data set, which is a sequence of tuples (x, y) where x isgenerated uniformly at random from the set {0, . . . , 10000} and y isgenerated uniformly at random from the set {0, . . . , 2¹³−1}. Thismaximum size of this data set is 10⁷. (3) The Zipfian data set, withα=1. Here the x values are generated according to the Zipfiandistribution with α=1, from the domain {0, . . . , 10000}, and the yvalues are generated uniformly at random from the set {0, . . . ,2¹³−1}. The maximum size of this dataset is 10⁷. (4) The Zipfian dataset as described above, with α set to 2.

For the sketch for F₂, the implementation used a variant of a processdeveloped in prior work. An observation was that for all cases tested,the relative error of the process was almost always within the desiredapproximation error ε, for δ<0.2.

Space Usage as a Function of ε

The space consumption of the process was studied. The space depends on anumber of factors, including the values of ε, and δ. The space alsodepends on ƒ_(max), since ƒ_(max) determines the maximum size of thedata structure at each level. But the most obvious dependence is on ε,and this dependence is explored further.

In FIG. 8, the space taken for the summary for F₂ is plotted as afunction of ε. This is illustrated for all the data sets describedabove, with each dataset of size 2 million tuples. It is noted that thespace taken by the sketch increases rapidly with decreasing ε, and therate of the growth is similar for all four datasets. For smaller valuesof ε (ε≦0.1), the summary data structure took even more space than theinput stream itself. This is not surprising since the space increases asthe fourth power of

$\frac{1}{\varepsilon}.$

It can be observed that the space taken for the Ethernet dataset isgreater than the space for the Uniform and Zipfian datasets. The reasonfor this is that the range of the y values in the Ethernet dataset wasmuch larger (0 to 3200000) than in the Uniform and the Zipfian datasets,where the y values ranged from 0 to 8191. The larger range of y valuesmeant that the number of nodes at each level is larger. In addition, alarger value of y_(max) also leads to a larger value of ƒ_(max), for theEthernet dataset than for the other datasets, and hence also to a largernumber of levels of data structures. However, note that the rate ofgrowth of the space with respect to ε remains similar for all data sets.For smaller values of ε (ε≦0.15), the summary data structure often tookeven more space than the input stream itself. This is not surprisingsince the space increases as the fourth power of

$\frac{1}{\varepsilon}.$

As described herein, the space savings of the sketch improves as thestream size increases.

Space Usage as a Function of the Stream Size

Example results are illustrated in FIG. 9 (for ε=0.15), FIG. 10 (forε=0.2), and FIG. 11 (for ε=0.25). In all cases, as predicted by theory,the space taken by the sketch does not change much, and increases onlyslightly as the stream size increases. This shows that the space savingsof the process is much larger with streams that are larger in size.

The time required for processing the stream of 2 million elements wasnearly the same (about 5 minutes) for the Uniform, Zipfian, as well asEthernet datasets. The processing rate can be improved by using theC/C++ language, and by using a more optimized implementation.

These example results illustrate that a reasonable processing rate canbe achieved for the data structure for F₂, and that correlated queryprocessing is indeed practical, and provides significant space savings,especially for large data streams (of the order of 10 million tuples orlarger).

Correlated F₀, Number of Distinct Elements

An example of the process for F₀ in Python and four data sets, asdescribed in the case of F₂, was also implemented. The only differencewas that in the Uniform and Zipfian datasets, the range of x values wasmuch larger (0 . . . 1000000) than in the case of F₂, where the range ofx values was 0 . . . 10000. The reason for this change is that there aremuch simpler approaches for correlated F₀ estimation when the domainsize is small: simply maintain the list of all distinct elements seen sofar along the x dimension, along with the smallest value associated withit in the y dimension. Note that such a simplification is not (easily)possible for the case of F₂.

The variation of the sketch size with ε is shown in FIG. 12. Note thatwhile the sketch size decreases with increasing ε, the rate of decreasewas not as fast as in the case of F₂. Further, note that the sketch sizefor comparable values of ε was much smaller than the sketch forcorrelated F₂. Another point is that, the space taken by the sketch forthe Ethernet dataset was significantly smaller than the sketch for theother datasets. This is due to the fact that the range of x values inthe Ethernet dataset was much smaller (0 . . . 2000) than for the otherdatasets (0 . . . 1000000). The number of levels in the data structureis proportional to the logarithm of the number of possible values alongthe x dimension. Note that as explained above, the process of choice forcorrelated F₀ estimation for the Ethernet-type datasets (where the xrange is small) will be different, as explained herein. Thus, theprocess is particularly useful for datasets where the x range is muchlarger.

The size of the sketch as a function of the stream size is shown in FIG.13, for ε=1. It can be seen that the sketch size hardly changed with thestream size. Note however, that for much smaller streams, the sketchwill be smaller, since some of the data structures at different levelshave not reached their maximum size yet. The results for other values ofε are similar, and are not shown here.

Accordingly, embodiments provide for computing correlated aggregatesover a data stream. Referring to FIG. 14, it will be readily understoodthat embodiments may be implemented using any of a wide variety ofdevices or combinations of devices. An example device that may be usedin implementing embodiments includes a computing device in the form of acomputer 1410. In this regard, the computer 1410 may execute programinstructions configured to compute correlated aggregates over a datastream, and perform other functionality of the embodiments, as describedherein.

Components of computer 1410 may include, but are not limited to, atleast one processing unit 1420, a system memory 1430, and a system bus1422 that couples various system components including the system memory1430 to the processing unit(s) 1420. The computer 1410 may include orhave access to a variety of computer readable media. The system memory1430 may include computer readable storage media in the form of volatileand/or nonvolatile memory such as read only memory (ROM) and/or randomaccess memory (RAM). By way of example,, and not limitation, systemmemory 1430 may also include an operating system, application programs,other program modules, and program data.

A user can interface with (for example, enter commands and information)the computer 1410 through input devices 1440. A monitor or other type ofdevice can also be connected to the system bus 1422 via an interface,such as an output interface 1450. In addition to a monitor, computersmay also include other peripheral output devices. The computer 1410 mayoperate in a networked or distributed environment using logicalconnections (network interface 1460) to other remote computers ordatabases (remote device(s) 1470). The logical connections may include anetwork, such local area network (LAN) or a wide area network (WAN), butmay also include other networks/buses.

As will be appreciated by one skilled in the art, aspects may beembodied as a system, method or computer program product. Accordingly,aspects of the present invention may take the form of an entirelyhardware embodiment, an entirely software embodiment (includingfirmware, resident software, micro-code, et cetera) or an embodimentcombining software and hardware aspects that may all generally bereferred to herein as a “circuit,” “module” or “system.” Furthermore,aspects of the present invention may take the form of a computer programproduct embodied in at least one computer readable medium(s) havingcomputer readable program code embodied thereon.

Any combination of at least one computer readable medium(s) may beutilized. A computer readable storage medium may be, for example, butnot limited to, an electronic, magnetic, optical, electromagnetic,infrared, or semiconductor system, apparatus, or device, or any suitablecombination of the foregoing. More specific examples (a non-exhaustivelist) of the computer readable storage medium would include thefollowing: an electrical connection having at least one wire, a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), an optical fiber, a portable compact disc read-onlymemory (CD-ROM), an optical storage device, a magnetic storage device,or any suitable combination of the foregoing. In the context of thisdocument, a computer readable storage medium may be any tangible ornon-signal medium that can contain or store a program for use by or inconnection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Computer program code for carrying out operations for embodiments may bewritten in any combination of at least one programming language,including an object oriented programming language such as Java,Smalltalk, C++ or the like and conventional procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The program code may execute entirely on the user's computer,partly on the user's computer, as a stand-alone software package, partlyon the user's computer and partly on a remote computer or entirely onthe remote computer or server. In the latter scenario, the remotecomputer may be connected to the user's computer through any type ofnetwork, including a local area network (LAN) or a wide area network(WAN), or the connection may be made to an external computer (forexample, through the Internet using an Internet Service Provider).

Embodiments are described with reference to figures of methods,apparatus (systems) and computer program products according toembodiments. It will be understood that portions of the figures can beimplemented by computer program instructions. These computer programinstructions may be provided to a processor of a general purposecomputer, special purpose computer, or other programmable dataprocessing apparatus to produce a machine, such that the instructions,which execute via the processor of the computer or other programmabledata processing apparatus, create means for implementing thefunctions/acts specified.

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified. The computer program instructionsmay also be loaded onto a computer, other programmable data processingapparatus, or other devices to cause a series of operational steps to beperformed on the computer, other programmable apparatus or other devicesto produce a computer implemented process such that the instructionswhich execute on the computer or other programmable apparatus provideprocesses for implementing the functions/acts specified.

This disclosure has been presented for purposes of illustration anddescription but is not intended to be exhaustive or limiting. Manymodifications and variations will be apparent to those of ordinary skillin the art. The example embodiments were chosen and described in orderto explain principles and practical application, and to enable others ofordinary skill in the art to understand the disclosure for variousembodiments with various modifications as are suited to the particularuse contemplated.

Although illustrated example embodiments have been described herein withreference to the accompanying drawings, it is to be understood thatembodiments are not limited to those precise example embodiments, andthat various other changes and modifications may be affected therein byone skilled in the art without departing from the scope or spirit of thedisclosure.

What is claimed is:
 1. A method for summarizing attributes of dataelements of a data set for computing correlated aggregates, comprising:receiving a stream of data elements at a device, each data elementhaving at least one numerical attribute; maintaining in memory aplurality of tree structures comprising a plurality of separate nodesfor summarizing numerical attributes of the data elements with respectto a predicate value of a correlated aggregation query, said maintainingcomprising: creating the plurality of tree structures in which each nodeimplements one of: a probabilistic counter and a sketch, wherein saidprobabilistic counter and said sketch each act to estimate aggregateddata element numerical attributes to form a summary of said numericalattributes; and responsive to a correlated aggregation query specifyingsaid predicate value, using said plurality of tree structures as asummary of said data element numerical attributes to compute a responseto said correlated aggregate query.
 2. The method of claim 1, whereinsaid summary of said data element numerical attributes has a guaranteedrelative error.
 3. The method of claim 2, wherein a memory sizerequirement for maintaining the summary of said data element numericalattributes is a function of the relative error.
 4. The method of claim1, wherein said summary provides an estimate that is provably close to acorrect answer to said correlated aggregation query.
 5. The method ofclaim 1, wherein each node implements a probabilistic counter, andfurther wherein said correlated aggregation query is a count query withrespect to said predicate value.
 6. The method of claim 5, wherein nodesof said plurality of tree structures are organized into levels ofprobabilistic counters.
 7. The method of claim 6, wherein each level ofnodes provides a more refined interval for counting said numericalattributes of said data elements.
 8. The method of claim 7, wherein eachnode is capped to a predetermined size, and further wherein, responsiveto the predetermined size being reached, two child nodes are formed fora parent node.
 9. The method of claim 1, wherein each node implements asketch function suitable for a statistical inquiry specified in saidcorrelated aggregation query.
 10. The method of claim 9, wherein saidstatistical inquiry is computation of a frequency moment F_(k) of saiddata stream.
 11. The method of claim 1, wherein said stream of dataelements comprises a stream of IP packets, wherein said data elementscomprise IP packets, and further wherein said numerical attributescomprise sizes of said IP packets.